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Abstract 



Uh ' A Bayesian nonparametric approach to the study of species diversity based 

^Q I on choosing a random discrete distribution as a prior model for the unknown 

relative abundances of species has been recently introduced in Lijoi et al. (2007, 

2008). Explicit posterior predictive estimation of species richness has been 

obtained under priors belonging to the a-Gibbs class (Gnedin & Pitman, 2006). 

Here we focus on posterior estimation of species evenness which accounts for 

diversity in terms of the proximity to the situation of uniform distribution of 

the population into different species. We focus on Simpson's index and provide 

^ . a Bayesian estimator under quadratic loss function, with its variance, under 

^«0 I some specific a— Gibbs priors. 



Introduction 



Species diversity is made of two components: species richness, the number of dif- 
ferent species belonging to a population, and species evenness, the distance of the 
actual relative abundances from a situation of uniform distribution of the popula- 
tion into different species. In other words a population of species is the more diverse 
the more is rich and uniform. The measurement of diversity of populations when 
^_' individuals are classified into groups has a long history, dating back to Simpson's 

(1949) and Fisher's (1943) seminal papers. Since then ecological literature has pro- 
duced a huge amount of results and a variety of indexes to measure both aspects 
of diversity. With respect to richness nonparametric estimation relevant recent ap- 
proaches includes Chao & Lee (1992), Shen et al. (2003) and Chao & Bunge (2002), 
while a Bayesian approach is in Hill (1979) and Boender & Rinnooy Kan (1987). 
With respect to evenness nonparametric estimation, results mostly concentrate on 
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Simpson's (1949) index and Shannon- Wiener's index (Pielou, 1975). (See e.g. Lloyd 
& Ghelardi, 1964, and Chao &: Shen, 2003). A Bayesian nonparametric estimation 
of Shannon's index under Fisher's (1943) model is in Gill & Joanes (1979). 

Recently a Bayesian nonparametric approach to species diversity has been in- 
troduced in Lijoi et al. (2007, 2008) by specifying a prior on the space of unknown 
relative abundances and deriving posterior results given the information contained 
in the multiplicities of the first k species observed in a basic n-sample. Posterior 
predictive estimation of species richness have been obtained under priors belonging 
to random discrete distributions inducing infinite exchangeable partitions in Gibbs 
form of type a, as devised in Gnedin & Pitman (2006). Asymptotic results for the 
behaviour of a proper normalization of the number of new species when the size of 
the additional sample tends to infinity have been provided under the same approach 
(Favaro et al. 2009, 2011; Cerquetti, 2011). 

Here, under the same setting, we propose a Bayesian nonparametric posterior 
estimation of the evenness of a population of species by focusing on Simpson's 
index of species diversity (Simpson, 1949) which actually measures evenness by the 
chance to observe two elements picked at random belonging to different species. We 
derive an explicit posterior estimator under quadratic loss function with its variance 
under two parameter (a, 9) Poisson-Dirichlet model (Pitman, 1995, Pitman and Yor, 
1997) and consequently derive explicit formulas for three particular cases this model 
contains: Dirichlet-Ewens (1972) sampling model (a = 0), Stable model {9 = 0) 
and Fisher's model (1943), {9 = ^\a\ for a < and ^ G N"*"). The organization of 
the paper is as follows. In Section 1 we recall some preliminaries about a— Gibbs 
exchangeable partitions and Simpson's index. In Section 2 we locate the evenness 
parameter by providing prior estimator of Simpson's index under some relevant 
a— Gibbs partition models. In Section 3 we obtain the general Bayesian posterior 
nonparametric estimator under the two-parameter (a, 0) model and its posterior 
variance, and derive the explicit posterior results for the particular cases treated in 
Section 2. Finally in Section 4 we suggest a feasible generalization of the results 
contained in the present paper. 

1 Preliminaries 

Recall that the exchangeable partition probability function describing the random 
behaviour of a partition of the natural integers obtained by sampling from a random 
discrete distribution {Pi)i>i belonging to the a-Gibbs class for a € (— oo, 1) (Gnedin 
and Pitman, 2006) is characterized by the product form 

k 

p(ni,...,nfe) = F„^fc JJ(1 -a)„^_i (1) 

j=i 

where Vn^k are coefficients satisfying the recursive relation Vn,k = {n — ka)Vn+i,k + 
^n+i,fe+i) and {x)y is the usual notation for rising factorials {x)y = x{x+l) ■ ■ ■ {x+y— 



1). Three convex subclasses of a-Gibbs partitions have been identified by Gnedin 
&; Pitman by deriving extreme partitions corresponding to different values of a, 
namely: mixtures of Fisher's models for a < 0, mixtures of Ewens-Dirichlet (6) 
models for a = 0, and mixtures of conditional Poisson-Kingman models driven by 
the a stable subordinator (Pitman, 2003) for a € (0, 1). 

The Bayesian nonparametric approach to species sampling problems under a- 
Gibbs priors has been introduced in Lijoi et al. (2007, 2008) under the following 
observational set up. Given Xi, . . . ,Xn a sample of observations (labels) from an 
infinite population of species with unknown relative abundances {Pi)i>i which has 
provided k different species with multiplicities in order of appearance (ni, . . . ,nk), 
let Xn+i, • • • , Xfi+m be a future sample of m additional observations. Then interest 
lies in posterior (conditional) inference on some characteristic of the unknown pop- 
ulation of abundances or in predictive inference on some characteristic of a future 
sample. Lijoi et al. (2007, 2008) provide the predictive distribution under a general 
a— Gibbs prior, for various quantities of interest related to the additional sample. 
In particular for K^ the number of new species they obtain the following result 

P«y„,.(K^ = k*\n„ ...,n,) = ]:kEh^5-i^-,-(-M (2) 

Vn,k 

where k* takes values in 0, . . . , m and S^ ^, "' " are non central generalized 

Stirling numbers. Now the most studied and tractable a— Gibbs model is the two- 
parameter Poisson-Dirichlet {a, 6) model, for a G (0, 1) and 6 > —a or a < and 
9 = |a|^ for ^ = 1,2,... introduced in Pitman (1995) and largely studied in Pitman 
and Yor (1997). It arises by mixing Poisson-Kingman models driven by the Stable 
subordinator (Pitman, 2003) by a polynomial tilting of the stable density and its 
EPPF is given by 

k 

Pa,e[ni,...,nk) = -77-— rr M(l-a)n,-i, (3) 

[U + lj„_i j-J^ 

where {x)y^a stands for generalized rising factorials {x)y^a = {x){x + a) ■ ■ ■ {x + 
{y — l)a). A huge amount of results is available for this model, mostly due to J. 
Pitman, and we will heavily rely on them in what follows. The posterior predictive 
distribution for the number of new species in an additional m-sample easily follows 
from ([2]) by some elementary combinatorial calculus and corresponds to 

^a,e{Km = k |ni,...,nfc)= S^j^ 



{0 + n)^ 



Additional results for an explicit Bayesian estimator of predictive species richness 
and on the asymptotic behaviour of a proper normalization of Km under two- 
parameter Poisson-Dirichlet priors are in Favaro et al. (2009), (see Cerquetti, 
2011, for a generalization of the asymptotic result to general a— Gibbs partitions 
for a^ (0,1)). 



In his seminal paper on Nature in 1949 Simpson introduces an index of diversity 
("a measure of the concentration of the classification") for an infinite population 
such that each individual belongs to one of k different groups, for Pi, . . . ,Pfc the 
proportions of individuals in variuos groups. It is defined as 



^5 = 1-E^i 



k 

2 
J 



namely the chance that two observations picked at random from the population belong 
to different species. The index can take any values between and 1 — 1/k, the 
former representing the smallest evenness {maximum concentration) and the latter 
the situation of maximum diversity (minimum concentration), hence it depends both 
on the richness of the population like on the evenness. Here dealing with priors on 
discrete distributions on an infinite number of classes we work with the infinite 
species version of Simpson's index which vares in (0, 1) hence becomes just an index 
of evenness. 

2 Prior evenness in a— Gibbs partition models 

Apart from Fisher's model and mixtures of Fisher's models (Gibbs class for a < 0), 
choosing a nonparametric prior in the a-Gibbs class, actually implies assuming the 
number of different species in the population under study to be unknown and the- 
oretically infinite. It follows that if we want to introduce a prior knowledge on 
the diversity of the population we cannot act on any kind of richness parameter. 
Nevertheless for each a-Gibbs prior it is always possible to identify an evenness 
parameter. Here we show how by studying the prior mean of Simpson's index. 

Proposition 1. [Prior mean and variance of Simpson index under a— Gibbs priors] 
Let (Xi, . . . ,Xn) be a sample from a population of species whose relative abun- 
dances are distributed according to a general a— Gibbs partition model, then for 
5*2 = X^?li Pf, the sequence of prior moments of 1 — Hs is given by 






hence a prior estimate of Simpson's index corresponds to 

E^y^jHs) = 1 - nY. ^') = 1 - E(^2) = 1 - 1^2,i(l - a) (5) 

i=i 

and its prior variance is given by 

Var{Hs) = Var{S2) =nsl)-mS2)f = Vi,i{l-a)3 + Vi,2{l-a)^-[V2,i{'^-a)f. 

(6) 



Proof. Let {Pi)i>i be the sequence of ranked atoms of a random discrete distribu- 
tion, and Pj the random size of the j'th atom discovered in the process of random 
samphng, or equivalently the asymptotic frequency of the jth class when the blocks 
of the partition generated are put in order of their least element. Now for the 
random variable 



Sm■■=Y.P^"' = Y.PJ 



ym 
J 
i=l 



where it is still assumed that Si = 1 almost surely, Pitman (2003) provides the 
following general expression for the .^th moment 

where the second sum is over all sequences of j positive integers (Ci, ■ ■ ■ ,^j) with 
^1 + . . . + ^j = ^. This implies the EPPF induced by sampling from a random 
discrete distribution directly determines the positive integers moments of the power 
sums Sm, hence the distribution of 5m, for each m. For m = 2, hy substituting the 
Gibbs form ([I]) in d?]), (gD follows. For ^ = 1 and ^ = 2 ([5]) and ([6]) can be easily 
obtained. D 

Example 2. [Two parameter Poisson-Dirichlet model\. For {Pj)j>i having two 
parameter Poisson-Dirichlet distribution (a,0), for a G (0,1) and 9 > —a then its 
size-biased permutation is characterized by (Pitman, 1996) 



Pj=WiJ[{l-Wi\ 



i=l 



for Wi independent Beta{l — a,9 + ja). The prior mean of Simpson's index hence 
corresponds to 



^aAHs 



a 



1 + 6' 

which is the probability to observe a new species in the classical sequential Chi- 
nese restaurant construction of the partition given the first observation. It follows 
both large values of 9 and a may increase the prior belief on the evenness of the 
population. As for the prior variance 

(l-«)3 + (g + a)(l-a)2 (l-a)2 

^"'('^'^^^^^^ = WTTh JoTW 

Example 3. [Fisher's model\ The simpler and older species sampling model was 
introduced in Fisher et al. (1943), and it is characterized by a finite number ^ of 
species whose unknown proportions (Pi, . . . ,P^) have a symmetric Dirichlet distri- 
bution with parameter (|q|, . . . , |a|), for a G (— oo, 0), arising by {Gi/G, . . . , G^/G) 



for Gi independent Ga(|a|,l) r.v.s with G = '^iGi. To identify the evenness pa- 
rameter is enough to notice that the marginal of Pj are -Beta(|a|,^|a|) distributed, 
so that 

Hence large values of \a\ means (Pj) fairly uniform are expected, while small values of 
\a\ correspond to widely different Pi expected. This shows |a| is indeed an evenness 
parameter. Its explicit prior variance is as follows 

,. , ^ , _ (1 + «)3 + {\a\C - |«|)(1 + af (1 + a? 

"'^^ ^^" (He + 1)3 (He + 1)2- 

Example 4. [Dirichlet-Ewens model]. For a — )■ 0, ^ — )■ oo and \a\$, -^ 9 then 
Fisher's model converges to the random atoms of the Dirichlet probability measure 
(Ferguson, 1973), whose ranked atoms are known to have (Pj) ~ PD{9) Poisson- 
Dirichlet distribution. Their size biased permutation has GEM distribution 

i-i 

Pj = w,llii-w,) 

i=l 

for Wi iid ^^ Beta{l,6), hence 

R 

Eo(H, 



1 + ^ 



This shows that 9 is exactly a parameter of evenness and that choosing a Dirichlet 
prior over the unknown relative abundances allows to explicitly introduce a prior 
hypothesis on the evenness of the population. The prior variance of Hs corresponds 
to 

T/ (n^ 6 + g 1 

'^''''^'''^ = j9TTh~¥TW 

Remark 5. Notice that mixing over 9, as from Gnedin & Pitman (2006), with a 
general mixing distribution over (0, oo) produces the class of exchangeable Gibbs 
partitions of type a = 0. In terms of random probability measure this corresponds 
to Antoniak's (1974) mixtures of Dirichlet processes. Despite it has received less 
attention in the Bayesian treatment of species sampling problems, we notice this 
class actually entails the possibility to explicitly introduce a prior on the evenness 
parameter of the Dirichlet partition model. 

Example 6. [a-Stable model] From the results for the Poisson-Dirichlet (a, 9) 
model, prior mean and variance of the Simpson's index for random discrete dis- 
tribution obtain by normalization of the ranked lengths of the ranked points of a 
Poisson process with mean intensity the Levy density of the a— Stable density for 
a G (0, 1), easily follow by letting ^ = 0, since PK{pa) = PD{a, 0), hence 

E^iHs) = a 



and 

(l-a)3 + (a)(l-a)^ 2 2a(l - q) 

\/ar„ if. = 1-a — 



6 ' ' 6 

It follows even in this case the value of a is a direct measure of the evenness of 
the population and may be chosen accordingly to prior information in a Bayesian 
nonpar ametric perspective. 

3 Posterior estimation of species evenness under PD(q;, 9) 
priors 

To obtain a Bayesian nonparametric estimator of Simpson's index of evenness we 
just need the posterior distribution of the size-biased ordered atoms of the random 
discrete distribution chosen as the prior model. An explicit result for the two pa- 
rameter {a, 9) model is in Pitman (1996). Notice that all the Examples treated in 
the previous section arise as particular cases of this model hence specific posterior 
estimators and their variance may be obtained by the general result for this class. 

Proposition 7. Let (Xj)j>i be a population of exchangeable species labels driven 
by a two-parameter Poisson-Dirichlet prior. Let (ni, . . . ,72^) be the multiplicities of 
the first k species observed in a basic n-sample. Then the Bayesian nonparametric 
estimate under quadratic loss function of Simpson's index of evenness Hs is given 
by 

F (HKr, r,\)-^ ^'=iK--")2 I {e + ka){l-a) 

liLa,g{Hs (ni,...,nk ) - i -^- r \ -r- r . (8) 

Proof Recall from Pitman (1996) that if (Xn) is a sample from a random discrete 
distribution P with ranked atoms having PD{a,9) distribution, then, conditionally 
given (ni, . . . , n^), the random partition of n induced by the first k different values 
Xi, . . . , Xk of {Xn)., the posterior random discrete distribution is given by 



Pn{-) = Y.hr.6^i-) + RkPk{-). (9) 



where {Pi^n-, ■ ■ ■ , Pk,nRk) has Dirichlet (ni — a, . . . , n^ — a, + ka) distribution, 
independently of the random discrete distribution Pk{-) which has ranked atoms 
{Qi)i>i having PD{a,9 + ka) distribution. Now, by the independence 



j; P,*2 1 jE(^ p2j ^ jE(i?i)E Yl Qle+ka 




and recalling that if X --^ Beta{a,b) then E{X^) = , ^" v's , since Pj^n ^ Beta{nj — 
a,9 + n — rij + a), and Rk ~ Beta{6 + ka, n — ka) then 

Additionally, by the structural distribution of PD{a, 9 + ka), Qi ^ Be{l — a,6 + 
ka + a) hence 

and the result is proved. D 

It is even possible to obtain the posterior variance of the Bayesian estimator of 
the Simpson's index under two parameter Poisson-Dirichlet model. 

Proposition 8. Let (Xj)j>i be a population of exchangeable species labels driven 
by a two-parameter Poisson-Dirichlet prior. Let (ni, . . . ,72^) be the multiplicities of 
the first k species observed in a basic n-sample, then the Bayesian nonparametric 
estimator of Proposition 7. is characterized by the following posterior variance 

Far„,(F,|(„,....,„.)) = 3=i<^^ + 2E.,,(n, -a.).(u. -a). 



+ n)4 i9-\-n) 

{6 + ka) [(1 - a)3 + {9 + ka + a)(l - af] 
^ {9 + n)i ^ 



^fe 



2Ei=iK-«)2(g + M(l-a) 
^ i9 + n)A 



Y:]=i{nj-a)2 + {e + ka){l 
{9 + n)2 



a) 



■ (10) 



Proof: It is enough to obtain the posterior second moment of 1 — Hg- We can always 
write 

(oo \ I ^ °° 

j=l J \j=l i=l 

which reduces to 

k oo fe oo 

= Y^ E{Pij + 2 Y, H%p!,n) + HRk)MY. ^')' + 2 E nplnRlMY ^- 

j=l iy^j i=l j=l i=l 

By known expressions for mixed moments of Dirichlet vectors, and since from Pit- 
man's formula recalled in Section 1, 

2n2 ^ro ^2 T. n „^ , TA r-, ^\2 (1 - a)3 + (9 + ka + a){l - a) 



E(E^?)' = HS2,e+ka)^ = 1^4,i(l-a)3+1^4,2(l-a 



. , (6* + to + 1)3 



it follows 






Ei=i("i - ")4 2 Ei^jK - «)2('^i - a)2 



' + n)4 



+ n)4 



{e + ka) [(1 - a)3 + {9 + ka + a)(l - g)^] 2^|=iK- - a)2(^ + M (1 - a) 



(6' + n) 



+ nJ4 



D 
Example 9. [Fisher's model (continued)] The Bayesian nonparametric estimator 
for Sinipons's index of evenness under (a,^|a|) Fisher's model is given by 



E^^aiHs\{ni,...,nk)) = 1 



Ej=i("i - «)2 + {\a\C - ka){l + a) 



\a\C + n)2 



and its posterior variance corresponds to 



Var{Hs\{ni,... ,nfc)) 



J2j=ii'n'j + «)4 2 ^i^jiuj + a)2(ni + 0)2 



|a|^ + n)4 



+ 



(|a|^ + n)4 



(|a|e - M[(l + a)3 + (|«|C - fcg - a)(l + a)^] 2E|=i(n, + a)2(|a|e - M(l + «) 



(|a|^ + n)4 



|a|^ + n)4 



Ej=iK- + ")2 + (|a|C - ka){l + q) 
(|a|^ + n)2 



Example 10 [Dirichlet-Ewens model (continued)] The Bayesian nonparametric es- 
timator for Simpson's index arises specializing ([8]) for a = hence 



EeiHs\ini,...,nk)) = l 



E-=iK)2 + g 



Similarly for the posterior variance of Hs we obtain 

E ■=i(%-)4 + 2Ei^,(%-)2(n.)2 + ^[(1)3 + e] + 20E •=iK-)2 



Var0{Hs\{ni,... ,nk)) 



' (E-=i(n,)2 + g ' 
(^ + n)2 



(6' + n)4 
2 



Example 11. [a— Stable model (continued)] Posterior mean and variance of Simp- 
son's index mider a— Stable model for a G (0, 1) arise by the general formulas for 
9 = hence 



EaiHs\ni,...,nk) 



, Ej=i(^j - a)2 ka{l 
1 — 1 V- 



a) 



(n) 



(nh 



and 



Var{Hs\ni, . . . , n^) = \- 

(n)4 

'^T.i^jjnj - (x)2ini - a)2 ka[{l - 0)3 + {ka + a)(l - af] 
(n)4 (n)4 

2Z]j=i("'i ~ a)2ka{l - a) Z]j=i('^j ~ ")2 + ^"(l - ") 



(n)4 

4 Future directions 



H2 



A generalization of Pitman's posterior result for the two parameter model ([9]) for 
random probability measure obtained by normalization of measures with indepen- 
dent increments is in James et al. (2009). Additionally an explicit stick breaking 
construction for the size biased atoms of the normalized Inverse Gaussian partition 
model, which belongs to the a Gibbs class for a = 1/2 and mixing distribution 
the exponentially tilted version of the 1/2— Stable distribution (cfr. Pitman, 2003), 
has been recently obtained in Favaro et al. (2012). Those results suggest it may 
be worth investigating the possibility to obtained an explicit Bayesian nonparamet- 
ric estimator for Simpson's index of evenness under this prior model for relative 
abundances. This will be the subject of future investigations. 
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